One of the best ways to understand the data you’re working with is to visualize it through charts and graphs. In R, there are many pre-installed commands that help you do this! The two most common ways to create these types of visualizations are the use of the plot() and hist() functions.

Before we put those to use these though, let’s import some data. In the following examples I am going to be using data from IPUMS on the proportion of adults in each state whose income falls below $20,000 for the year 2015. This is a proportion because it ranges from 0 to 1. A percentage varies from 0 to 100.

#Reading .csv file into R
library(foreign)
example.data <- read.csv(file = "~/Dropbox/309/RHelp/example.data.csv")

Histograms

Now that our data is imported into R, let’s try to visualize it to get a better understanding of what it contains. One of the easiest ways to do this is through a simple histogram. Histograms show how frequently a certain variable takes particular values, so it can be a very useful tool to understand a variable’s distribution. Let’s see what the most common proportion of adults whose income falls below $20,000 per year is using the hist() function. This function only requires one argument to run: the variable you wish to plot.

The variable for this data is labeled as income.20 within the example.data dataset, so it can be accessed as example.data$income.20. The data in this example are measured at the state level: there is one row in the dataset for each state (plus one for D.C.).

#Place the data into your function
hist(example.data$income.20)

We have a histogram! The x-axis contains the proportion of those whose income falls below $20,000 per year and the y-axis contains the frequency (number of states that fall between two proportions: a bin). This is hard to understand though – the main title and the label of the x-axis don’t really tell us what we’re looking at. Let’s change that.

Adding X-Axis Label to Histograms

As I said before, the hist() function only requires one argument to run, but many more can be placed within it to change what R outputs. To change the label of the x-axis, the argument xlab is used. Assign text to this argument within the hist() function to change the label! (Make sure that the text is contained within double quotes as shown in the code below below.)

Hint: If you’re ever stuck, and don’t know what to put in a function to get it to do what you want it to do, use the help() function. Within your console, place whatever function you’re attempting to use within the help() parantheses. R will tell you what the function does and what its arguments are. For example: help(hist) will give you a detailed explanation of the histogram function and its arguments.
#Here I add a label to the x-axis
hist(example.data$income.20, xlab = "Proportion of People Earning Below $20,000")

Now we have a better understanding of what is being displayed on the x-axis. The y-axis label does not need to be changed in this example because it already acurately describes what it’s displaying.

Adding a Title

It’s easy to change the main title of the graph as well! This will be done in the same way that a label was added to the x-axis with xlab. This requires the main argument. As you add more and more arguments to the function, be sure to place a comma after each one!

#Adding a title to the histogram
hist(example.data$income.20, 
     xlab = "Proportion of People Earning Below $20,000", 
     main = "Histogram of People Living in Poverty in 2015")

Changing Histogram Breaks

Whenever you plot a histogram, R automatically sets the number of bins (bars on the histogram) on the graph based on the data it is plotting. In the above examples, R plotted 10 bins. It is possible to change this characteristic of the graph as well. To change the number of bins in a histogram, we use the breaks argument in the hist() function. Assigning a number to this argument will change the number of bins on the histogram.

Here is an example applying 6 to breaks:

hist(example.data$income.20,
     breaks = 6,
     xlab = "Proportion of People Earning Below $20,000", 
     main = "Histogram of People Living in Poverty in 2015")

Notice that it didn’t change? Because of the way R runs its algorithm to break up the data, the bins don’t correspond to exactly the number you put in. That being said, it will generally give you what you want. If you apply a significantly higher number, the number of bins will increase. The same can be said for a significantly smaller number. Here is a better example where I apply 12 to breaks:

hist(example.data$income.20,
     breaks = 12,
     xlab = "Proportion of People Earning Below $20,000", 
     main = "Histogram of People Living in Poverty in 2015")

While R didn’t give me exactly 12 bins, it did give me a much more spaced out histogram. The best thing to do with this argument is to play around with the numbers until you like the way the histogram looks. But, be sure the number of breaks you use doesn’t give an inaccurate representation of the data!

Adding Color to Histogram

Changing the colors of the bins on the histogram is very easy to do as well! To do this we’ll use the col argument. There are quite literally hundreds of colors you can use in R, you just have to pick the one you want.

You can click here to see all of the possible colors you can use in R! For this example, say we want to change the color to “firebrick”. You must place this text within quotation marks or R will not understand what you’re trying to do.

Here is the example of using the col (an abbreviation for color) argument to utilize the “firebrick” color:

hist(example.data$income.20,
     breaks = 12,
     col = "firebrick",
     xlab = "Proportion of People Earning Below $20,000", 
     main = "Histogram of People Living in Poverty in 2015")

Now that the graph has appropriate labels and a nice format, we can finally draw some conclusions from the data. It appears that, for the year 2015, about 12 states had about 15% of its population earning below $20,000. In fact, it also appears that at least one state had nearly 30% of its people living below that line! This histogram makes it easy to see the poverty rates among states in the U.S.


Scatter Plots

While histograms are a great way to visualize the shape of a single variable, scatter plots are an excellent tool to help you discover relationships between variables. For example, say you wanted to know how one variable (x) compares to another variable (y), how would you go about analyzing that relationship in R? It’s quite simple actually! As I mentioned previously, R has very helpful built in functions to help you analyze your data – plot() is another one of those useful functions. This function requires two arguments to operate: x and y. Within this function you’ll want to define what you want to plot along the x-axis, and define what you’d like to plot along the y-axis.

In the following examples I’m going to examine the relationship between those that earn under $20,000 per year and those that did not finish their high school education. In this example, the proportion of the population earning under $20,000 per year is our dependent variable (y-axis) and the proportion of the population who did not finish high school is our independent variable (x-axis). We will still use example.data$income.20 from the previous examples as our dependent variable, but we’ll be using example.data$highschool.drop as our new independent variable.

#Plotting the dependent and independent variables using plot()
plot(x = example.data$highschool.drop, y = example.data$income.20)

We now have a very basic scatter plot! Before we come to any conclusions about what our data is showing us, let’s make the graph a bit nicer and add some labels. Changing the main title, x-axis label, and y-axis label can all be done in the same way that we changed them for the histogram.

Changing Label of X-Axis

First, let’s change the label of the x-axis.

#Changing label of x-axis from highschool.dropout to "Proportion of People Who Didn't Graduate High School"
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School")

Changing label of y-axis

Second, let’s change the label of the y-axis.

#Changing label of y-axis from poverty to "Proportion of People Earning Below $20,000"
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School", 
     ylab = "Proportion of People Earning Below $20,000")

Adding a Main Title

Finally, let’s add a title to the graph.

#Creating a title for the graph
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School", 
     ylab = "Proportion of People Earning Below $20,000",
     main = "Relationship Between High School Graduation and Poverty")

This graph now has appropriate labels and it’s much easier to understand!

Changing Plot Symbols

There are many arguments you can add to the plot() function to change what it outputs. For example, let’s say you didn’t want to use the default empty circles as points on your plot but instead wanted filled-in circles. To change this, you must use the pch argument (an abbrevation for ``point character’’) within the plot() function. The picture below shows you what number you must apply to the pch function to change the symbol. For a filled in circle, you assign 19 to pch.

Image source: statmethods.net

Here is how it looks when applied:

#Changing symbol from default empty circle to filled in circle
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School", 
     ylab = "Proportion of People Earning Below $20,000",
     main = "Relationship Between High School Graduation and Poverty",
     pch = 19)

Changing Plot Colors

Changing the colors of the points on the graph is done in the exact same way that you’d change the colors of the bins on a histogram: using the col argument. Here is the link to all of the possible colors you can use in R once more.

For this example, we’ll change the color to “deepskyblue”:

#Changing color from default black to deepskyblue
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School", 
     ylab = "Proportion of People Earning Below $20,000",
     main = "Relationship Between High School Graduation and Poverty",
     pch = 19,
     col = "deepskyblue")

Changing the Limits on the Axes

When you create a graph using plot(), R automatically selects a range of values to be placed on the x- and y-axes based on the data it’s graphing. In the above examples, R automically picked 0.06 percent to 0.12 percent because all of the values fit in that range. But let’s say you wanted to change this range for some reason. For example, let’s say you wanted the x-axis to start at 0% on the and go all the way to 20%. An easy way to change this is through the use of the xlim argument.

To do this, you must place your desired values in a vector than can be read by R. To place values within a vector in R, we use the c() function. This is a very generic function that just combines its elements into a vector. If you wanted to use 0% and 20% as the outer limits to the x-axis, you must put them in a vector using their actual values, 0.0 and 0.2.

Here is an example of changing the limits on the x-axis:

#Changing the limits on the x-axis
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School", 
     ylab = "Proportion of People Earning Below $20,000",
     main = "Relationship Between High School Graduation and Poverty",
     pch = 19,
     col = "deepskyblue",
     xlim = c(0.0, 0.2))

Now there is a lot more room on this graph! If we also wanted to change the limits of the y-axis from 10% to 30%, we’d use the same method but different argument! This time we’d use ylim.

Below is an example of just that (remember to place these values within c()!):

#Changing the limits on the y-axis
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School", 
     ylab = "Proportion of People Earning Below $20,000",
     main = "Relationship Between High School Graduation and Poverty",
     pch = 19,
     col = "deepskyblue",
     xlim = c(0, 0.2),
     ylim = c(0.1, 0.3))

These new axis ranges make the strong positive relationship between the independent and dependent variables even clearer; the original choices were ``zoomed in’’ a little more which could be helpful if you were interested in the exact values the x- and y-variables take. There are always trade-offs in data visualization!

Changing Point Size

It is also possible to change the size of the points on your graph. To do this, we need yet another argument: cex. The default value assigned to cex is 1, so any number applied to it will be that many times bigger. If we applied 2, for example, the points on the plot would be two times bigger. We can even make the points smaller by using a fraction! Applying .5 to cex would make the points half as big as the default size of 1.

Here is an example of doubling the size of the points (applying 2 to cex):

#Doubling the size of the points on the graph
plot(x = example.data$highschool.drop, y = example.data$income.20, 
     xlab = "Proportion of People Who Didn't Graduate High School", 
     ylab = "Proportion of People Earning Below $20,000",
     main = "Relationship Between High School Graduation and Poverty",
     pch = 19,
     cex = 2,
     col = "deepskyblue",
     xlim = c(0, 0.2),
     ylim = c(0.1, 0.3))

While you can certainly edit your graph to your liking, make sure not to go overboard. The past few examples I gave were just demonstrations of what you can do with R, not necessarily what you ought to do. Always make sure to find the simplest, most appealing way to display your data!


Line Plots

Line plots are a great tool to help visualize time-series data. It becomes easy to see trends and patterns over time when all the data is connected by lines! Line plots are created in nearly the exact same manner as scatter plots. You use the same function, plot(), and can even use all of the same arguments (such as xlim, ylab, main, etc.). The one thing you’re adding to the plot() function to make it a line plot (instead of a scatter plot) is the type argument.

Placing type = “l” within the plot argument will produce a line plot. (The letter “l” in quotation marks—a lowercase L—stands for line.) Below, I created an example to show how this is done. The data used in the graphs are the poverty rates in Pennsylvania from the year 2001 to 2015. These variables are labeled poverty.pa and year respectively, from the dataset poverty. Since you already know how to create a main title and x- and y-axes labels, I’ve already included them in the graph. The important part to notice in the example is the type argument.

#Reading in the data
poverty <- read.csv(file = "~/Dropbox/309/RHelp/poverty.csv")

#Plotting poverty rates in Pennsylvania from 2001 to 2015
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",      
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rate in Pennsylvania from 2001 to 2015")

A simple line graph! Let’s keep going.

Changing Line Types

On top of being able to create a line plot using the plot() function, you can also change the type of the line you’re plotting with. To change line types, you must use the lty argument (which stands for Line TYpe). This argument changes the type of the line that is plotted based on the value that is assigned to it.

Here is a list of the possible line types in R, using lty:

Image source: sthda.com

In the following example I apply “2” to lty to create a dashed line plot:

#Plotting poverty rates in Pennsylvania from 2001 to 2015
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rate in Pennsylvania from 2001 to 2015")

Changing Line Width

Changing the width of the line on the plot is also possible with the plot() function. To do this, we use the lwd argument (Line WiDth). In this argument, the line width that you assign is relative to the default (which is 1). So if you used the argument, lwd = 3, the line output would be three times as wide. In the following example I’m also going to change the line’s color to red using col to show you what it would look like.

Here it is in practice:

#Plotting poverty rates in Pennsylvania from 2001 to 2015
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rate in Pennsylvania from 2001 to 2015")

See how it got much thicker?

Adding Another Line to Your Graph

There will be many times when you want to plot more than one line on one graph. Maybe you want to know whether certain years were above or below the mean. To do this, we can use the abline() command with the h= argument to plot a horizontal line.

#Plotting poverty rates in Pennsylvania from 2001 to 2015
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rate in Pennsylvania from 2001 to 2015")
abline(h=mean(poverty$poverty.pa))

abline() takes other arguments too. You can make a vertical line with v= and set a slope and y=intercept with a= (for the y=intercept) and b= for the slope (to remember these, you may remember your high school algebra teacher talking about lines of the formula y=a+bx)

To illustrate this, let’s add a couple of lines to the plot and make them different colors and styles:

#Plotting poverty rates in Pennsylvania from 2001 to 2015
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rate in Pennsylvania from 2001 to 2015")
abline(h=mean(poverty$poverty.pa))
abline(v=2010,col="blue")
abline(a=-4.85,b=.0025,col="darkgray",lty=3)

As another example, say you wanted not only the poverty rate of PA from 2001 to 2015, but also that of California. To do this we use a new function, lines(). Placing this function directly after your plot() function tells R that you want to add more lines to your graph. The arguments to this function are x, a common variable among the two lines (in this case it’s year), and y, the second variable you wish to plot.

In the example below I add another line for the poverty rate for the state of California from the year 2001 to 2015, poverty.ca (I’m also going to make it blue using col):

#Plotting poverty rates in Pennsylvania from 2001 to 2015
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rates from 2001 to 2015")

#Adding a second line for California
lines(x = poverty$year, y = poverty$poverty.ca, col = "blue")

We have two lines! But we have a problem. R only allotted enough room to plot Pennsylvania, so part of California’s line is cut off. Let’s fix this by changing the size of the y-axis using the ylim argument. Make sure to change use this argument in the plot() function – it will not work in lines().

#Changing the size of the y-axis
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     ylim = c(0.15,0.24),
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rates from 2001 to 2015")

#Adding a second line for California
lines(x = poverty$year, y = poverty$poverty.ca, col = "blue")

Perfect! All of the data now fits in the plot.

Playing with Axis Labels

We can use the cex.axis, cex.main and cex.lab arguments to change the size (relative to a baseline of 1) of various elements of the plot. So, setting cex.main=1.5 makes the title of the plot 1.5x its normal size. This can be nice to help a plot stand out.

plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     ylim = c(0.15,0.24),
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rates from 2001 to 2015",
     cex.main=1.5)

#Adding a second line for California
lines(x = poverty$year, y = poverty$poverty.ca, col = "blue")

Let’s also change the size of the axis text to 1.5x their normal size:

plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     ylim = c(0.15,0.24),
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rates from 2001 to 2015",
     cex.main=1.5,
     cex.axis=1.5)

#Adding a second line for California
lines(x = poverty$year, y = poverty$poverty.ca, col = "blue")

That’s really big! Let’s make them a little smaller and change the axis labels to make them bigger, as well.

plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     ylim = c(0.15,0.24),
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rates from 2001 to 2015",
     cex.main=1.5,
     cex.axis=1.25,
     cex.lab=1.25)

#Adding a second line for California
lines(x = poverty$year, y = poverty$poverty.ca, col = "blue")

Now our y-axis label is very big. You can incorporate a line break in a label by typing \n in the text. Let’s add a line break to the y-axis label. To illustrate, I’m going to also shrink the axis labels. This new text size is too small to be readable, but it will be useful as an illustration:

plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     ylim = c(0.15,0.24),
     xlab = "Year",
     ylab = "Proportion of People\nEarning Below $20,000",
     main = "Poverty Rates from 2001 to 2015",
     cex.main=1.5,
     cex.axis=1.25,
     cex.lab=.5)

#Adding a second line for California
lines(x = poverty$year, y = poverty$poverty.ca, col = "blue")

Adding a Legend

Now that we’ve fixed the margins of this graph, we need to differentiate between the two lines. The thicker dashed line stands for Pennsylvania while the thinner solid line stands for California. To distinguish the two from eachother, we need to use a legend. Legends give you more detail about what you’re looking at and generally appear on the graph itself.

To place a legend on this graph, we have to use the legend() function. This must be placed directly after the previous two functions. In order to run, legend() needs some basic arguments. The first of which tells R where to place the legend on the plot. To do this you simply tell R using one of the following keywords: “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”. This text does not have to be applied to an argument so long as it is placed first within the function and is within quotation marks.

Second, you’ll want to indicate the names of the lines by applying them to the legend argument. Yes, the legend argument within the legend() function (I know it can be confusing). To do this, use the c() function to create a vector of the names in the order that you wish them to be displayed. If you want California to be first, and Pennsylvania to be second, it would look like this: legend = c(“California”, “Pennsylvania”). Always remember to place text within quotation marks.

Third, you’ll want to indicate to R what symbols should be appearing within the legend (it cannot automatically figure them out). To do this, we’ll tell R that what the line type and thickness is. These arguments, if you remember, were lty and lwd respectively. An lty of 1 means regular line, and an lty of 2 means dashed. Similarly, an lwd of 1 means regular thickness, and an lwd of 2 means double thickness.

To get R to read these in, make sure to place them in a vector in accordance with how you placed the names in the vector. If California is first, then it will have the attributes of 1 for lty and 1 for lwd in the first spot in each vector. Pennsylvania will have 2 for both arguments in the second spot of each vector (this is because it is being displayed after California).

Finally, we’ll tell R what color to make the symbols in the legend using col. Make sure to place these in a vector as well in the order you have already established!

Remember to use the c() function again to place all of these arguemnts in vectors. Ultimately it would look like this: legend(“topright”, legend = c(“California”, “Pennsylvania”), lty = c(1, 2), lwd = c(1, 3), col = c(“blue”, “red”)).

Here it is in practice:

#Plotting Pennsylvania
plot(x = poverty$year, y = poverty$poverty.pa, 
     type = "l",
     lty = 2,
     lwd = 3,
     col = "red",
     ylim = c(0.15,0.24),
     xlab = "Year",
     ylab = "Proportion of People Earning Below $20,000",
     main = "Poverty Rates from 2001 to 2015")

#Adding a second line for California
lines(x = poverty$year, y = poverty$poverty.ca, col = "blue")

#Adding a legend
legend("topright", legend = c("California", "Pennsylvania"), lty = c(1, 2), lwd = c(1, 3), col = c("blue", "red"))

We finally finished a two line plot! This plot is easy to read because there are clearly two different lines that stand for two different states. From this plot it appears that, while these states generally follow the same trend, Califonia has had more of it’s population in poverty than Pennsylvania.


Bar Plots

Now that we’ve covered histograms, scatter plots, and line plots, there’s one more basic plot to learn: barplots! Barplots are extremely useful in examining categorical data (data represented by groups). The data that I will be using for this example are the states by geographic region from the example.data dataset. The variables are categorized from 1-4, with each number representing a different region of the country. 1 represents the South, 2 represents the Northeast, 3 represents the Midwest, and 4 represents the West. The states were coded follwing U.S. Census Bureau definitions of each region.

When making barplots, use the barplot() function! It only requires one argument to run: the categorical variable you wish to plot. Let’s test it out to see what happens.

#Creating a barplot of the number of states that reside in each region in the US
barplot(example.data$region)

This is not what a barplot is supposed to look like! We only want 4 bars! The reason we’re running into trouble here is because R doesn’t know how to count up all of the states residing within each group. Instead, what it did was plot each indivual state on a plot that doesn’t make much sense. To fix this we must use the table() function.

This function takes in data and builds a contingency table of the counts of each factor level. This means it adds up all of the states in the South (1), all of the states in the Northeast (2), and so on. We should now be able to graph our plot!

#Creating a contingency table of the counts of states per region
counts <- table(example.data$region)

#Let's look at our table
counts 
## 
##  1  2  3  4 
## 16  9 12 13
#Creating a barplot of the number of states that reside in each region in the US
barplot(counts)

Perfect! Now to make this graph more understandable we should add labels. Adding the main title, x-axis label, and y-axis label is all done exactly the same way that you would do it using the plots() function to make a scatter or line plot. Simply just use the main, xlab, and ylab arguments. Also, the first bar goes above the y-axis, so let’s extend the y-axis using ylim argument.

#Creating a contingency table of the counts of states per region
counts <- table(example.data$region)

#Creating a barplot of the number of states that reside in each region in the US
barplot(counts,
        main = "Number of States Per Region in the U.S.",
        xlab = "Region of the Country",
        ylab = "Frequency",
        ylim = c(0,20))

This looks much better. However, any average onlooker would have no idea what the numbers 1-4 designate. It is generally good practice to label the bars on a plot such as this. To do this, we will use the names.arg argument. This argument takes in as many labels as there are values on the plot. To apply values to this argument, we have to create a vector of labels using the c() function.

For example, the names of the regions in this plot would be placed as names.arg = c(“South”, “Northeast”, “Midwest”, “West”). It is very important that you place these labels in order, just as the variable is categorized, otherwise you will mislabel one of the categories.

Here it is in practice:

#Creating a contingency table of the counts of states per region
counts <- table(example.data$region)

#Creating a barplot of the number of states that reside in each region in the US
barplot(counts,
        main = "Number of States Per Region in the U.S.",
        xlab = "Region of the Country",
        ylab = "Frequency",
        ylim = c(0,20),
        names.arg = c("South", "Northeast", "Midwest", "West"))

Adding Colors to Barplots

Just as you can add color to all of the previous graphs using the col argument, you can do the same for barplots.

Here is an example using red:

#Creating a contingency table of the counts of states per region
counts <- table(example.data$region)

#Creating a barplot of the number of states that reside in each region in the US, and adding red as a color
barplot(counts,
        main = "Number of States Per Region in the U.S.",
        xlab = "Region of the Country",
        ylab = "Frequency",
        ylim = c(0,20),
        names.arg = c("South", "Northeast", "Midwest", "West"),
        col = "red")

Easy enough! But this doesn’t make much sense. It would be better if they were different colors reflecting the different regions of the country. This is also an easy fix. To do this we once again have to employ the use of the c() function to create a vector of colors.

To create this vector, pick a color for each region and put it in a vector in the order that they appear on the barplot. If you want blue for South, red for Northeast, yellow for Midwest, and green for West, they must be placed in the vector in that order. (Don’t forget the quotation marks!)

Here it is demonstrated:

#Creating a contingency table of the counts of states per region
counts <- table(example.data$region)

#Creating a barplot of the number of states that reside in each region in the US, as well as adding colors for each region
barplot(counts,
        main = "Number of States Per Region in the U.S.",
        xlab = "Region of the Country",
        ylab = "Frequency",
        ylim = c(0,20),
        names.arg = c("South", "Northeast", "Midwest", "West"),
        col = c("blue", "red", "yellow", "green"))

There you have it! A nearly perfect barplot (you should use prettier colors, though!).

Boxplots

R also makes boxplots easily. The formula is a little different than a scatter plot or a line plot, though: we use ``formula notation’’ in which a tilde (in the upper-left corner of your keyboard) separates the dependent variable and the independent variable. Let’s make a boxplot of high school dropout rate by region. All of the other plotting commands we are used to work with boxplots, as well. The only difference is that the argument for naming the x-axis labels is names rather than names.arg.

boxplot(example.data$highschool.drop~example.data$region, main="Dropout Rate by Region", 
    xlab="Region", 
    ylab="High School Dropout Rate",
    names = c("South", "Northeast", "Midwest", "West"))


Helpsheet made by Jacob Ryan.